{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/Cleanlab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Trustworthy RAG with LlamaIndex and Cleanlab\n",
    "\n",
    "LLMs occasionally hallucinate incorrect answers, especially for questions not well-supported within their training data. While organizations are adopting Retrieval Augmented Generation (RAG) to power LLMs with proprietary data, incorrect RAG responses remain a problem.\n",
    "\n",
    "This tutorial shows how to build **trustworthy** RAG applications: use [Cleanlab](https://help.cleanlab.ai/tlm/) to score the trustworthiness of every LLM response, and diagnose *why* responses are untrustworthy via evaluations of specific RAG components.\n",
    "\n",
    "Powered by [state-of-the-art uncertainty estimation](https://cleanlab.ai/blog/trustworthy-language-model/), Cleanlab trustworthiness scores help you automatically catch incorrect responses from any LLM application. Trust scoring happens in real-time and does not require any data labeling or model training work. Cleanlab provides additional real-time Evals for specific RAG components like the retrieved context, which help you root cause *why* RAG responses were incorrect. Cleanlab makes it easy to prevent inaccurate responses from your RAG app, and avoid losing your users' trust."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "This tutorial requires a:\n",
    "- Cleanlab API Key: Sign up at [tlm.cleanlab.ai/](https://tlm.cleanlab.ai/) to get a free key\n",
    "- OpenAI API Key: To make completion requests to an LLM\n",
    "\n",
    "Start by installing the required dependencies. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index cleanlab-tlm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, re\n",
    "from typing import List, ClassVar\n",
    "import pandas as pd\n",
    "\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "from cleanlab_tlm import TrustworthyRAG, Eval, get_default_evals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the OpenAI client using its API key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = \"<your-openai-api-key>\"\n",
    "\n",
    "llm = OpenAI(model=\"gpt-4o-mini\")\n",
    "embed_model = OpenAIEmbedding(embed_batch_size=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we initialize Cleanlab's client with default configurations. You can achieve better detection accuracy and latency by adjusting [optional configurations](https://help.cleanlab.ai/tlm/tutorials/tlm_advanced/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"CLEANLAB_TLM_API_KEY\"] = \"<your-cleanlab-api-key\"\n",
    "\n",
    "trustworthy_rag = (\n",
    "    TrustworthyRAG()\n",
    ")  # Optional configurations can improve accuracy/latency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read data\n",
    "\n",
    "This tutorial uses Nvidia’s Q1 FY2024 earnings report as an example data source for populating the RAG application's knowledge base."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2025-05-07 16:13:28--  https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md\n",
      "Resolving cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)... 54.231.236.193, 16.182.70.65, 52.217.14.204, ...\n",
      "Connecting to cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)|54.231.236.193|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 7379 (7.2K) [binary/octet-stream]\n",
      "Saving to: ‘NVIDIA_Financial_Results_Q1_FY2024.md’\n",
      "\n",
      "NVIDIA_Financial_Re 100%[===================>]   7.21K  --.-KB/s    in 0s      \n",
      "\n",
      "2025-05-07 16:13:28 (97.7 MB/s) - ‘NVIDIA_Financial_Results_Q1_FY2024.md’ saved [7379/7379]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'\n",
    "!mkdir -p ./data\n",
    "!mv NVIDIA_Financial_Results_Q1_FY2024.md data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago \n"
     ]
    }
   ],
   "source": [
    "with open(\n",
    "    \"data/NVIDIA_Financial_Results_Q1_FY2024.md\", \"r\", encoding=\"utf-8\"\n",
    ") as file:\n",
    "    data = file.read()\n",
    "\n",
    "print(data[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a RAG pipeline\n",
    "\n",
    "Now let's build a simple RAG pipeline with LlamaIndex. We have already initialized the OpenAI API for both as LLM and Embedding model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader\n",
    "\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Data and Create Index + Query Engine\n",
    "\n",
    "Let's create an index from the document we just pulled above. We stick with the default index from LlamaIndex for this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(\"data\").load_data()\n",
    "# Optional step since we're loading just one data file\n",
    "for doc in documents:\n",
    "    doc.excluded_llm_metadata_keys.append(\n",
    "        \"file_path\"\n",
    "    )  # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file\n",
    "index = VectorStoreIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generated index is used to power a query engine over the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that Cleanlab is agnostic to the index and the query engine used for RAG, and is compatible with any choices you make for these components of your system.\n",
    "\n",
    "In addition, you can just use Cleanlab in an existing custom-built RAG pipeline (using any other LLM generator, streaming or not). <br>\n",
    "Cleanlab just needs the prompt sent to your LLM (including system instructions, retrieved context, user query, etc.) and the generated response.\n",
    "\n",
    "We define an event handler that stores the prompt that LlamaIndex sends to the LLM. Refer to the [instrumentation documentation](https://docs.llamaindex.ai/en/stable/examples/instrumentation/basic_usage/) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.instrumentation import get_dispatcher\n",
    "from llama_index.core.instrumentation.events import BaseEvent\n",
    "from llama_index.core.instrumentation.event_handlers import BaseEventHandler\n",
    "from llama_index.core.instrumentation.events.llm import LLMPredictStartEvent\n",
    "\n",
    "\n",
    "class PromptEventHandler(BaseEventHandler):\n",
    "    events: ClassVar[List[BaseEvent]] = []\n",
    "    PROMPT_TEMPLATE: str = \"\"\n",
    "\n",
    "    @classmethod\n",
    "    def class_name(cls) -> str:\n",
    "        return \"PromptEventHandler\"\n",
    "\n",
    "    def handle(self, event) -> None:\n",
    "        if isinstance(event, LLMPredictStartEvent):\n",
    "            self.PROMPT_TEMPLATE = event.template.default_template.template\n",
    "            self.events.append(event)\n",
    "\n",
    "\n",
    "# Root dispatcher\n",
    "root_dispatcher = get_dispatcher()\n",
    "\n",
    "# Register event handler\n",
    "event_handler = PromptEventHandler()\n",
    "root_dispatcher.add_event_handler(event_handler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each query, we can fetch the prompt from `event_handler.PROMPT_TEMPLATE`. Let's see it in action."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use our RAG application\n",
    "\n",
    "Now that the vector database is loaded with text chunks and their corresponding embeddings, we can start querying it to answer questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.\n"
     ]
    }
   ],
   "source": [
    "query = \"What was NVIDIA's total revenue in the first quarter of fiscal 2024?\"\n",
    "\n",
    "response = query_engine.query(query)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This response is indeed correct for our simple query. Let's see the document chunks that LlamaIndex retrieved for this query, from which we can easy verify this response was right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_retrieved_context(response, print_chunks=False):\n",
    "    if isinstance(response, list):\n",
    "        texts = [node.text for node in response]\n",
    "    else:\n",
    "        texts = [src.node.text for src in response.source_nodes]\n",
    "\n",
    "    if print_chunks:\n",
    "        for idx, text in enumerate(texts):\n",
    "            print(f\"--- Chunk {idx + 1} ---\\n{text[:200]}...\")\n",
    "    return \"\\n\".join(texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- Chunk 1 ---\n",
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago ...\n",
      "--- Chunk 2 ---\n",
      "- **Gross Margins**: GAAP and non-GAAP gross margins are expected to be 68.6% and 70.0%, respectively, plus or minus 50 basis points.\n",
      "- **Operating Expenses**: GAAP and non-GAAP operating expenses are...\n"
     ]
    }
   ],
   "source": [
    "context_str = get_retrieved_context(response, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add a Trust Layer with Cleanlab\n",
    "\n",
    "Let's add a detection layer to flag untrustworthy RAG responses in real-time. TrustworthyRAG runs Cleanlab's state-of-the-art uncertainty estimator, the [Trustworthy Language Model](https://cleanlab.ai/tlm/), to provide a **trustworthiness score** indicating overall confidence that your RAG's response is *correct*. \n",
    "\n",
    "To diagnose *why* responses are untrustworthy, TrustworthyRAG can run additional evaluations of specific RAG components. Let's see what Evals it runs by default:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "context_sufficiency\n",
      "response_groundedness\n",
      "response_helpfulness\n",
      "query_ease\n"
     ]
    }
   ],
   "source": [
    "default_evals = get_default_evals()\n",
    "for eval in default_evals:\n",
    "    print(f\"{eval.name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each Eval returns a score between 0-1 (higher is better) that assesses a different aspect of your RAG system:\n",
    "\n",
    "1. **context_sufficiency**: Evaluates whether the retrieved context contains sufficient information to completely answer the query. A low score indicates that key information is missing from the context (perhaps due to poor retrieval or missing documents).\n",
    "\n",
    "2. **response_groundedness**: Evaluates whether claims/information stated in the response are explicitly supported by the provided context.\n",
    "\n",
    "3. **response_helpfulness**: Evaluates whether the response attempts to answer the user query in a helpful manner.\n",
    "\n",
    "4. **query_ease**: Evaluates whether the user query seems easy for an AI system to properly handle. Complex, vague, tricky, or disgruntled-sounding queries receive lower scores.\n",
    "\n",
    "To run TrustworthyRAG, we need the prompt sent to the LLM, which includes the system message, retrieved chunks, the user's query, and the LLM's response.\n",
    "The event handler defined above provides this prompt.\n",
    "Let's define a helper function to run Cleanlab's detection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper function to run real-time Evals\n",
    "def get_eval(query, response, event_handler, evaluator):\n",
    "    # Get context used by LLM to generate response\n",
    "    context = get_retrieved_context(response)\n",
    "    # Get prompt template used to build the prompt\n",
    "    pt = event_handler.PROMPT_TEMPLATE\n",
    "    # Build prompt\n",
    "    full_prompt = pt.format(context_str=context, query_str=query)\n",
    "\n",
    "    eval_result = evaluator.score(\n",
    "        query=query,\n",
    "        context=context,\n",
    "        response=response.response,\n",
    "        prompt=full_prompt,\n",
    "    )\n",
    "    # Evaluate the response using TrustworthyRAG\n",
    "    print(\"### Evaluation results:\")\n",
    "    for metric, value in eval_result.items():\n",
    "        print(f\"{metric}: {value['score']}\")\n",
    "\n",
    "\n",
    "# Helper function run end-to-end RAG\n",
    "def get_answer(query, evaluator=trustworthy_rag, event_handler=event_handler):\n",
    "    response = query_engine.query(query)\n",
    "\n",
    "    print(\n",
    "        f\"### Query:\\n{query}\\n\\n### Trimmed Context:\\n{get_retrieved_context(response)[:300]}...\"\n",
    "    )\n",
    "    print(f\"\\n### Generated response:\\n{response.response}\\n\")\n",
    "\n",
    "    get_eval(query, response, event_handler, evaluator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Evaluation results:\n",
      "trustworthiness: 1.0\n",
      "context_sufficiency: 0.9975124377856721\n",
      "response_groundedness: 0.9975124378045552\n",
      "response_helpfulness: 0.9975124367363073\n",
      "query_ease: 0.9975071027792313\n"
     ]
    }
   ],
   "source": [
    "get_eval(query, response, event_handler, trustworthy_rag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analysis:** The high `trustworthiness_score` indicates this response is very trustworthy, i.e. non-hallucinated and likely correct. The context that was retrieved here is sufficient to answer this query, as reflected by the high `context_sufficiency` score. The high `query_ease` score indicates this is a straightforward query as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let’s run a *challenging* query that **cannot** be answered using the only document in our RAG application's knowledge base."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Query:\n",
      "How does the report explain why NVIDIA's Gaming revenue decreased year over year?\n",
      "\n",
      "### Trimmed Context:\n",
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
      "\n",
      "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
      "\n",
      "### Generated response:\n",
      "The report indicates that NVIDIA's Gaming revenue decreased year over year by 38%, which is attributed to a combination of factors, although specific reasons are not detailed. The context highlights that the revenue for the first quarter was $2.24 billion, down from the previous year, while it did show an increase of 22% from the previous quarter. This suggests that while there may have been a seasonal or cyclical recovery, the overall year-over-year decline reflects challenges in the gaming segment during that period.\n",
      "\n",
      "### Evaluation results:\n",
      "trustworthiness: 0.8018049078305449\n",
      "context_sufficiency: 0.26134514055082803\n",
      "response_groundedness: 0.8147481620994604\n",
      "response_helpfulness: 0.28647897539109127\n",
      "query_ease: 0.952132218665045\n"
     ]
    }
   ],
   "source": [
    "get_answer(\n",
    "    \"How does the report explain why NVIDIA's Gaming revenue decreased year over year?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analysis:** The generator LLM avoids conjecture by providing a reliable response, as seen in the high `trustworthiness_score`. The low `context_sufficiency` score reflects that the retrieved context was lacking, and the response doesn’t actually answer the user’s query, as indicated by the low `response_helpfulness`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s see how our RAG system responds to another *challenging* question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Query:\n",
      "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\n",
      "\n",
      "### Trimmed Context:\n",
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
      "\n",
      "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
      "\n",
      "### Generated response:\n",
      "NVIDIA's revenue decreased by $1.10 billion this quarter compared to the last quarter.\n",
      "\n",
      "### Evaluation results:\n",
      "trustworthiness: 0.572441384819641\n",
      "context_sufficiency: 0.9974990573223977\n",
      "response_groundedness: 0.006136548076912901\n",
      "response_helpfulness: 0.997512230771839\n",
      "query_ease: 0.8018484929561781\n"
     ]
    }
   ],
   "source": [
    "get_answer(\n",
    "    \"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analysis**: The generated response incorrectly states that NVIDIA's revenue decreased this quarter, when in fact the referenced report notes a 19% increase quarter-over-quarter. \n",
    "\n",
    "Cleanlab's low trustworthiness score helps us automatically catch this incorrect RAG response in real-time!  To root-cause why this response was untrustworthy, we see the `response_groundedness` score is low, which indicates our LLM model is to blame for fabricating this false information. \n",
    "\n",
    "Let's try another one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Query:\n",
      "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?\n",
      "\n",
      "### Trimmed Context:\n",
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
      "\n",
      "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
      "\n",
      "### Generated response:\n",
      "If NVIDIA's Data Center segment maintains its quarter-over-quarter growth rate of 18% from Q1 FY2024 for the next four quarters, the projected revenue for the next four quarters can be calculated as follows:\n",
      "\n",
      "1. Q1 FY2024 revenue: $4.28 billion\n",
      "2. Q2 FY2024 projected revenue: $4.28 billion * 1.18 = $5.05 billion\n",
      "3. Q3 FY2024 projected revenue: $5.05 billion * 1.18 = $5.95 billion\n",
      "4. Q4 FY2024 projected revenue: $5.95 billion * 1.18 = $7.02 billion\n",
      "5. Q5 FY2024 projected revenue: $7.02 billion * 1.18 = $8.27 billion\n",
      "\n",
      "Now, summing these revenues for the five quarters (including Q1 FY2024):\n",
      "\n",
      "- Total projected revenue = $4.28 billion + $5.05 billion + $5.95 billion + $7.02 billion + $8.27 billion = $30.57 billion\n",
      "\n",
      "Therefore, the projected annual revenue for the Data Center segment would be approximately $30.57 billion.\n",
      "\n",
      "### Evaluation results:\n",
      "trustworthiness: 0.23124932848015411\n",
      "context_sufficiency: 0.9299227307108295\n",
      "response_groundedness: 0.31247206392894905\n",
      "response_helpfulness: 0.9975055879546202\n",
      "query_ease: 0.7724662723193096\n"
     ]
    }
   ],
   "source": [
    "get_answer(\n",
    "    \"If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analysis**: Reviewing the generated response, we find it overstates (sums up the financials of Q1) the projected revenue. Again Cleanlab helps us automatically catch this incorrect response via its low `trustworthiness_score`.  Based on the additional Evals, the root cause of this issue again appears to be the LLM model failing to ground its response in the retrieved context."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom Evals\n",
    "\n",
    "You can also specify custom evaluations to assess specific criteria, and combine them with the default evaluations for comprehensive/tailored assessment of your RAG system.\n",
    "\n",
    "For instance, here's how to create and run a custom eval that checks the conciseness of the generated response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conciseness_eval = Eval(\n",
    "    name=\"response_conciseness\",\n",
    "    criteria=\"Evaluate whether the Generated response is concise and to the point without unnecessary verbosity or repetition. A good response should be brief but comprehensive, covering all necessary information without extra words or redundant explanations.\",\n",
    "    response_identifier=\"Generated Response\",\n",
    ")\n",
    "\n",
    "# Combine default evals with a custom eval\n",
    "combined_evals = get_default_evals() + [conciseness_eval]\n",
    "\n",
    "# Initialize TrustworthyRAG with combined evals\n",
    "combined_trustworthy_rag = TrustworthyRAG(evals=combined_evals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Query:\n",
      "What significant transitions did Jensen comment on?\n",
      "\n",
      "### Trimmed Context:\n",
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
      "\n",
      "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
      "\n",
      "### Generated response:\n",
      "Jensen Huang commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI.\n",
      "\n",
      "### Evaluation results:\n",
      "trustworthiness: 0.9810004109697261\n",
      "context_sufficiency: 0.9902170786836257\n",
      "response_groundedness: 0.9975123614036665\n",
      "response_helpfulness: 0.9420916924086002\n",
      "query_ease: 0.5334109647649754\n",
      "response_conciseness: 0.842668665703559\n"
     ]
    }
   ],
   "source": [
    "get_answer(\n",
    "    \"What significant transitions did Jensen comment on?\",\n",
    "    evaluator=combined_trustworthy_rag,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replace your LLM with Cleanlab's\n",
    "\n",
    "Beyond evaluating responses already generated from your LLM, Cleanlab can also generate responses and evaluate them simultaneously (using one of many [supported models](https://help.cleanlab.ai/tlm/api/python/tlm/#class-tlmoptions)). <br />\n",
    "You can do this by calling `trustworthy_rag.generate(query=query, context=context, prompt=full_prompt)` <br />\n",
    "This replaces your own LLM within your RAG system and can be more convenient/accurate/faster.\n",
    "\n",
    "Let's replace our OpenAI LLM to call Cleanlab's endpoint instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### Query:\n",
      "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\n",
      "\n",
      "### Trimmed Context:\n",
      "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
      "\n",
      "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
      "\n",
      "- **Quarterly revenue** of $7.19 billion, up 19% from the pre\n",
      "\n",
      "### Generated Response:\n",
      "NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and for the previous quarter (Q4 FY23), it was $6.05 billion. Therefore, the revenue increased by $1.14 billion from the previous quarter, not decreased. \n",
      "\n",
      "So, the revenue did not decrease this quarter vs last quarter; it actually increased by $1.14 billion.\n",
      "\n",
      "### Evaluation Scores:\n",
      "trustworthiness: 0.6810414232214796\n",
      "context_sufficiency: 0.9974887437375295\n",
      "response_groundedness: 0.9975116791816968\n",
      "response_helpfulness: 0.3293002430120912\n",
      "query_ease: 0.33275910932109172\n"
     ]
    }
   ],
   "source": [
    "query = \"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\"\n",
    "relevant_chunks = query_engine.retrieve(query)\n",
    "context = get_retrieved_context(relevant_chunks)\n",
    "print(f\"### Query:\\n{query}\\n\\n### Trimmed Context:\\n{context[:300]}\")\n",
    "\n",
    "pt = event_handler.PROMPT_TEMPLATE\n",
    "full_prompt = pt.format(context_str=context, query_str=query)\n",
    "\n",
    "result = trustworthy_rag.generate(\n",
    "    query=query, context=context, prompt=full_prompt\n",
    ")\n",
    "print(f\"\\n### Generated Response:\\n{result['response']}\\n\")\n",
    "print(\"### Evaluation Scores:\")\n",
    "for metric, value in result.items():\n",
    "    if metric != \"response\":\n",
    "        print(f\"{metric}: {value['score']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While it remains hard to achieve a RAG application that will accurately answer *any* possible question, you can easily use Cleanlab to deploy a *trustworthy* RAG application which at least flags answers that are likely inaccurate.  Learn more about optional configurations you can adjust to improve accuracy/latency in the [Cleanlab documentation](https://help.cleanlab.ai/tlm/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cl",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
